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\  ABSTRACT 

S  r- 

-We<  considerjthe  generalised  eigenvalue  problem,  (A  - 
and  M  are  large,  sparse,  symmetric  matrices.  For  large  problems 
few  eigenpairs  involves  a  major  computational  task.  In  a 
structural  dynamic  analysis  with  matrices  of  order  8000, 
required  to  compute  50  eigenpairs.  It  is  therefore  interesting  to  examine  the 
advantage  that  vector  computers  such  as  CYBER  205  can  offer.  tr>  f?„  ,  Jy 

f  '  '^r£— We  adopted  our  best  versions  of  the  Subspnee  Herat  km^Method  and  the 
simple  Lancsos  Method  in  order  to  take  advantage  of  thenSml  vector  processor 
of  the  CYBER  205.  Both  techniques  lend  themsefoes  to  vectorixation.  Our 
extensive  comparisons  support  the  following  genend'statements.  Both  methods 
require  the  triangular  factorisation  of  the  same  large  n  X  n  matrix.  This  factori¬ 
zation  dominates  the  total  computation  as  n  -*■  90  provided  that  the  number  of 
wanted  eigenpairs,  p,  remains  fixed  (independent  of  n).  However,  simple  Lancsos 
is  at  least  an  order  of  magnitude  more  efficient  (in  CPU-time)  for  the  remainder 
of  the  computation.  For  /-40,  n— 500  the  factorisation  time  is  not  important 
and  the  full  order  of  magnitude  difference  is  seen  in  the  total  CPU- time.  When 
n»8000  simple  Lancsos  is  only  4  times  faster  than  Subspace  Iteration  on 
the  CYBER  205.  This  confirms  experience  on  serial  computers.  - - - - 


For  problems  that  cannot  fit  into  primary  storage,  input/output  becomes 
increasingly  important.  We  found  that  the  cost  of  input/output  dominated  over 
the  CPU-cost  for  a  problem  that  required  twice  the  available  primary  storage  on 
our  CYBER  205.  However,  this  will  depend  on  the  billing  algorithm  of  the  com¬ 
puter  center.  We  conclude  that  problems  which  have  a  substantial  overhead  in 
reading  and  writing  the  matrices,  should  not  be  solved  by  the  simple  Lancsos 
Method,  but  by  a  Block  Lancsos  Method. 
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1.  Introduction. 

The  pnrpose  of  this  project  was  to  examine  the  performance  of  two  different  methods  for 
the  solution  of  the  eigenvalue  problem 

(A-XM)n«0  (1) 

when  implemented  on  a  CYBER  206  vector  computer.  We  investigated  the  Snbspace  Iteration 
Method  [13],  (I)  (called  SUBIT  hereafter),  which  is  widely  wed  on  traditional  serial  computers  for 
medium  sized  and  large  eigeaptohlama,  and  the  LaBcaw*  Method  [8],  [10]  (called  LANC 
hereafter),  which  recently  hw  been  shown  to  he  an  order  of  magnitude  faster  than  SUBIT  [9].  It 
was  of  interest  to  see  how  the  methods  compare  on  a  vector  computer. 

Standard  in-core  versions  of  the  two  algorithms  were  modifted  to  take  advantage  of  the  spe¬ 
cial  features  of  the  CYBER  206  |4j.  The  algsrith—  were  not  redesigned,  bet  wherever  passible,  a 
CYBER  206  vector  Auction  was  substituted  for  the  original  FORTRAN  code, 

The  tost  prehlema  were  derived  horn  the  dynamic  analysis  of  idealised  3-dimensioaal  struc¬ 
tures,  m  modelled  by  the  Isito  element  program  FEAP  [16,  Ch.  24|.  For  varioes  problem  sixes 
the  two  methods  were  tiamd  extensively,  both  before  and  after  the  explicit  vectoriiation  took 
place. 

tW«t  mad*  pertli  by  a  gnat  at  cifatir  Owe  from  CDC  (gnat  #OSCSU33).  Paitisl  support  by  tbs  U5. 

OSes  at  Nenl  Swrewh  uadsr  ceuM—l  NOOS14.7S-OOQ1*  is  gntsfaly  sdcaeehdgsd. 
fOu  he* e  leu  Remised  Seghael  Ciltg*.  II  IMS  SUnagar,  Nsnrty. 


3.  The  oiturt  of  tho  eigenvalue  probkm. 

We  are  interested  in  the  solution  of  eq.  (1)  when  A  and  M  are  large,  sparse,  n  X  n ,  sym¬ 
metric  matrices.  In  many  applications  the  matrices  have  a  banded  form,  i.e.  a,;  ™  0  for  all 
I  *-j  |  >  where  is  the  (i,/)-element  of  A  and  bu  is  the  half  bandwidth.  For  typical  prob¬ 
lems  in  structural  dynamics  bm  is  (5-10)%  of  n,  cfr.  Section  5.  M  must  be  positive  semi-definite. 
In  some  cases  M  (or  A)  may  be  diagonal,  which  leads  to  a  significant  savings  in  the  computa¬ 
tional  effort  with  either  method. 

“Large”  today  means  n  >  10*,  but  10  years  ago  10*  was  considered  large.  In  these  large 
problems  all  eigenpain  (X,,  x,)  in  a  given  interval  may  be  sought;  these  ate  usually  quite  few, 
perhaps  between  10  to  50.  The  interval  may  be  at  either  end  of  the  eigenspectrum  and  may  con¬ 
tain  the  origin.  For  best  convergence  properties  it  is  best  to  perform  a  shift  of  origin  to  this  inter¬ 
val  before  the  actual  eigen  computation  takes  place.  For  more  details,  see  Sections  4,  4.1,  and  4.2 
as  well  as  (10]. 

S.  Special  features  on  CYBER  MS. 

The  CYBER  206  is  capable  of  attaining  a  rate  of  several  hundred  million  floating  point 
operations  (add  or  multiply)  per  second,  depending  on  the  actual  machine  configuration  and  also 
on  the  precision  of  the  arithmetic.  We  have  used  a  205  with  2  pipes  in  ordinary  single  precision 
(64  bits),  and  an  asymptotic  rate  of  100  megaflops  [4},[7].  One  megaflop  (mflop)  is  a  rate  of  1  mil¬ 
lion  floating  point  operations  per  second. 

3.1.  Veetorlsatlon. 

The  top  performance  can  only  be  obtained  by  those  parts  of  a  program  that  operate  on  vec¬ 
tors,  where  a  “vector”  is  a  set  of  consecutive  memory  ceRi  afl  treated  in  the  same  way. 

In  the  following  ample  example  there  are  three  vectors,  each  corresponding  to  the  n  list 
elements  of  arrays  B,  A,  and  D.  The  vector  in  B  is  the  Schar  product  of  the  vectors  in  A  and  D. 

There  are  see rnti ally  two  ways  to  achieve  vector  performance  for  this  DO-kwp  on  the 
CYBER  205.  The  FORTRAN  code  in  Fig.  1  stay  be  left  unchanged  and  then  be  compiled  with 


DO  100 I-1,N 
B(I)-A<I)*D(I) 

100  CONTINUE 

Fig.  1.  DO- loop  operating  on  vectors  (Schar  product) 

special  vector  optimixers  (“automatic  vectorisatiou”),  or  we  may  replace  the  DO-loop  by  a  direct 
reference  to  the  vector  multiply  instruction,  see  Fig .  2. 

B<1;N)-A(1;N)*D(1;N) 

Fig.*.  Vector  multiply  instruction  corresponding  to  Fig.  1 

In  either  case  the  vector  instruction  is  composed  of  a  start-up  phase  during  which  the  operands 
are  lined  up  and  made  ready,  and  the  actual  execution  phase  with  two  operations  (because  of  two 
pipes)  per  CPU-cycle  (one  CPU-eycle  was  equal  to  20  nanoseconds  on  the  CYBER  205).  Because 
the  unproductive  start-up  phase  has  to  be  amortised  over  all  operations,  the  longer  the  vector 
length,  the  higher  the  performance  rate.  (However,  there  is  a  maximum  allowed  vector  length  of 
05635  elements.)  We  have  timed  the  vector  multiply  instruction  (Fig.  2)  for  various  vector 
lengths,  and  we  have  found  good  agreement  with  the  performance  data  given  by  CDC  [7|;  e.g.  a 
rate  of  50  mflops,  or  half  the  asymptotic  rate,  is  achieved  with  vector  lengths  about  100,  and  80 
mflops  is  reached  when  the  vector  lengths  are  about  400.  The  results  of  our  direct  performance 
measurements  are  presented  in  Appendix  A. 

Consider  next  the  example  in  Fig.  3,  where  there  are  two  input  vectors  and  one  input  scalar 
that  are  to  be  combined  through  an  addition  and  a  multiplication.  We  refer  to  this  as  a  SAXPY 
(single  precision  s  Xs  plus  jr),  and  on  the  CYBER  205  it  may  be  realised  as  one  vector  opera¬ 
tion,  i.e.  after  start-up  two  output  dements  (because  of  two  pipes)  are  computed  per  CPU-cycle. 
As  is  customary  in  numerical  analysis  we  shall  count  the  production  of  one  output  element  as  one 
looting  point  operation.  CDC  uses  a  different  terminology  and  calls  a  SAXPY  a  linked  triad,  and 
also  counts  each  pats  within  a  SAXPY  as  two  looting  point  operations  (mnltiplkation  and  addi¬ 
tion).  With  vector  lengths  100,  we  found  an  effective  rate  of  about  37  mfops;  60  adopt  is 
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reached  for  vector  lengths  about  160,  80  mflops  for  vector  lengths  about  650,  and  the  asymptotic 
value  is  still  100  mflops  (cfr.  Appendix  A). 

DO  200  1—1, M 

Y(I)-Y(I)+A*X(I) 

200  CONTINUE 

Fig.  3.  DO- loop  for  the  SAXPY 

Again,  the  vector  operation  will  be  invoked  if  automatic  vectoriiatioa  is  specifled  during 
compilation,  or  if  explicit  vector  code  is  substituted  in  the  program,  see  Fig.  4. 

Y(1;M)— Y(1;M)+  A*X(1;M) 

Fig.  4.  Vector  code  corresponding  to  Fig.  3 

The  dot  product  of  two  vectors  may  be  coded  as  ia  Fig.  5. 

DOT— 0. 

DO  100 1— 1,N 

DOT— DOT+  R(I)*U(I) 

100  CONTINUE 

Fig.  S.  DO- loop  for  the  dot  product 

The  dot  product  function  on  the  CYBER  205,  Q8SDOT,  is  not  a  vector  function,  but  a  simula¬ 
tion  of  one,  see  Fig.6.  It  has  been  implemented  with  scalar  instructions  ia  a  very  efficient  way. 

DOT— Q8SDOT(R(  1;N),U(1;N)) 

Fig.  3.  “Vector”  dot  product  equivalent  to  Fig.  5 

Q8SDOT  has  a  relatively  slow  start-up  phase,  and  then  performs  oae  partial  product  and 
accumulation  per  cycle.  This  operation  is  unrelated  to  the  pipeline  feature.  On  a  CYBER  205 
with  two  pipelines,  as  waa  the  case  ia  the  preaeat  investigation,  the  dot  product  will  therefore 
only  reach  half  the  speed  of  the  SAXPY  tor  long  vectors.  We  shall  count  one  partial  product  and 
accumulation  as  one  floating  point  operation  (thus  Fip.  5  and  0  each  contains  u  point 


operations).  Ia  oar  direct  measurements  of  the  rector  dot  prodact  (Fig.  6)  we  found  aa  effective 
rate  of  about  20  mflops  with  vector  leagths  100,  aad  about  43  ml! ops  with  vector  ieagths  1000 
(cfr.  Appendix  A).  For  all  vector  lengths  it  is  to  be  seen  from  Appendix  A  that  the  SAXPY  is 
almost  twice  as  fast,  aad  the  Schur  product  more  than  twice  as  fast,  as  the  dot  product. 

The  ratios  of  mllop  rates  ia  Appendix  A  suggest  strongly  that  programs  for  the  CYBER  205 
should  be  written  using  linear  combinations  of  vectors  (SAXPY’s)  rather  than  dot  products.  For 
example,  if  V  is  n  X  m  and  P  is  m  X  m  then  the  product  Z  ■  VP  can  be  computed  either  as  nm 
dot  products  of  length  m  (requires  V  held  by  rows)  or  as  m*  SAXPY’s,  each  of  length  n.  The 
second  form  (see  Fig.  7)  is  dearly  preferable  when  n  >  m. 

DO  200  J— 1,M 

Z(1,J;N)-P(1,J)*V(1,1;N) 

DO  200  I— 2M 

Z(l,  J;N)-Z(1,J;N)+  P(I,  J)*V(UN) 

200  CONTINUE 

Fig.  7.  SAXPY’s  used  for  matrix  multiplication,  n>m 

When  n  <  m,  the  lint  form  (using  dot  products)  may  be  faster  than  Fig.  7,  because  the  vector 
leagths  are  longer.  However,  it  is  more  efficient  to  compute  the  rows  of  Z  as  linear  combinations 
of  the  rows  of  P  (using  SAXPY’s),  although  then  P  aad  Z  should  be  held  by  raws. 

U.  Memory  management. 

The  CYBER  205  has  virtual  memory  (theoretical  upper  bound  2X 10“  words  per  user).  The 
real  memory  on  the  machine  we  used  was  2  miflioa  words,  aad  there  was  a  30.5  Mbit/s  link  to  a 
total  of  450X 10*  words  on  disk.  A  program,  with  instructions  and  data,  will  be  organised  on 
pages,  for  which  there  ate  two  choices,  either  small  pages  (equal  to  512  words)  or  large  pages 
(equal  to  06530  words).  The  paging  system  will  seek  to  keep  the  most  recently  need  pages  in  the 
primary  storage.  When  the  program  references  data  (or  instructions)  that  do  not  at  that  moment 
reside  ia  the  primary  storage,  a  “page  fault’’  occurs.  CYBER  206  will  halt  the  execution  of  the 
program  until  the  page  that  contains  the  requested  data  has  been  transferred  from  the  secondary 
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storage,  usually  at  the  expense  of  another  page  being  pat  oat  to  the  secondary  storage.  While 
this  swapping  takes  place,  the  CYBER  205  may  execute  other  programs  that  are  allowed  to 
occupy  part  of  central  memory.  There  is  a  small  overhead  in  CPU-time  when  a  vector  crosses  a 
page  boundary,  and  also  during  a  page  fault.  When  a  program  references  data  in  an  “orderly" 
fashion,  as  when  consecutive  columns  of  a  matrix  are  used  consecutively  and  not  at  random,  it  is 
more  efficient  to  use  large  pages.  This  is  indeed  the  case  with  both  our  eigenvalue  methods. 

The  accounting  system  assesses  a  cost  penalty  for  a  large  page  fault  of  .150  SBU  (System 
Billing  Units),  equivalent  to  .156  CPU-eeconds  [12].  For  luge  problems  that  cannot  be  fitted  into 
primary  storage,  this  penalty  might  actually  be  larger  than  the  CPU  time  for  the  whole  computa¬ 
tion. 

4.  Description  of  the  two  eigenvalue  methods. 

We  are  interested  in  some  of  the  eigenvalues  closest  to  a  specified  value,  a.  In  each  method 
we  shall  perform  the  same  initial  calculations: 

•  shift  the  A  matrix  by  r.  A  «■  A  -  <rM 

-  factor  X  into  L  A  Lr 

Here  L  is  a  lower  triangular  matrix  with  diagonal  elements  equal  to  one,  A  is  a  diagonal  matrix, 
and  Lr  is  the  transpose  of  L.  Incidentally  the  number  of  negative  elements  in  A  equals  the 
number  of  eigenvalues  that  are  smaller  than  a. 

We  used  a  standard  active  column  profile  solver  (called  PR  OF  EL).  The  upper  triangular 
part  of  X  is  stored  and  gradually  overwritten  with  Lr.  Consider  the  computation  of  a  typical 
columa  of  Lr  of  “height"  k  (above  the  diagonal).  Each  element  will  require  a  dot  product  and 
the  lengths  of  the  vectors  involved  will  vary  horn  1  to  A- 1.  Altogether  nk  dot  products  are 
needed  for  Lr  and  the  average  length  is  i/2,  where  i  is  the  average  half  bandwidth  of  X 

Of  course  PROF1L  could  be  reorganised  to  compute  L  by  columns  and  so  replace  dot  pro 
ducts  by  SAXPTs.  This  helps,  but  aot  much.  The  significant  fact  m  that  k/2  is  small  compared 
to  n  in  most  structural  problems  and  the  factorisation  of  narrow  banded  matrices  cannot  exploit 
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the  fail  rector  mflop  rate  of  the  CYBER  205  since  it  is  manipulating  rectors  of  (arerage)  length 
4  /2  rather  than  n. 

As  we  shall  see  the  factorisation  process  dominates  both  methods  to  a  greater  extent  on  the 
CYBER  205  than  in  serial  machines. 

The  remaining  part  of  either  method,  SUB  IT  or  LANC,  is  formulated  in  such  a  way  as  to 
solre  the  following  transformed  eigenralue  problem 

((LALrrlM  (2) 

The  largest  a’s  correspond  to  the  eigenralues  X  closest  to  o  (these  are  the  smallest  X’s  when  <7— 0) 
according  to 


The  eigenrectors,  x,  in  Eq.  (2)  are  the  same  as  in  Eq.  (1).  See  |5)  for  more  on  this  transformation 
of  the  problem. 

The  following  sections,  4.1  and  4.2,  will  describe  in  detail  the  two  methods.  At  each  stage 
we  emphasize  the  rector  operations. 

4.1.  Subspace  Iteration  (SUBIT). 

The  method  works  with  m  iteration  rectors  at  a  time.  We  say  that  the  subepace  dimension 
is  m.  At  step  4  the  current  set  of  rectors,  held  as  the  columns  of  the  nXm  matrix  &*~l\  is 
replaced  by  another  set,  held  as  the  columns  of  S<*>.  The  columns  of  S***  are  kept  mutually 
orthogonal. 

During  each  major  step  k,  (4—1,2,  •  •  • ),  the  following  tasks  an  performed. 

4.1.1.  M-operatlon. 

(a)  Compute 

B<*>  -  (4) 

This  involves  a  total  of  (24*  +  l)nm  operations,  where  4*  is  the  arerage  half 
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bandwidth  of  M.  For  a  general  M  we  would  have  an  average  vector  length  equal  to 
iy.  But  when  M  it  diagonal,  0  and  only  m  vector  Sehur  product*  of  length  n 
are  performed. 

(b)  Compute  the  upper  triangular  part  of  the  symmetric  m  x  m  projection  matrix 

M**>  -  8<*-l>r  R<1>  (5) 

Here  m*/2  vector  dot  product*  of  leagth  »  are  performed. 

4.1.8.  A-projectlon. 

(a)  Solve  AS^-R^*1  for  S***.  nit  is  done  in  three  phase*: 

(al)  Solve  for  C 

L  C  -  R<*>  (8) 

nm  dot  product*  with  average  vector  length  b. 

(a2)  Compute 

F  -  A"1  C  (7) 

m  vector  Schur  product*  of  leagth  n. 

(a3)  Solve  for  S1** 

Lr  3^-F  (8) 

nm  SAXPY’s  with  average  vector  length  b. 

With  care  C,  F,  and  5(t)  can  share  the  same  storage  area. 

(b)  Compute  the  upper  triangular  part  of  the  symmetric  mXm  projection  matrix 

A<*>  -  r>‘  R<*>  (9) 

m*/ 2  vector  dot  product*  with  vector*  of  leagth  n. 

4.1J.  Small  eignnproMem. 


Solve  the  projected  m  by  m  eigenvalue  problem 

(A<*>-**W*>) 


(10) 


for  all  m  eigenvalues  ^  and  (orthogonal)  eigenvectors  G(t). 

Here  we  transformed  the  problem  to  a  special  eigenvalue  problem  (10,  cb  1S|,  which 
was  solved  by  the  EISPACK  subroutine  EISQL  (6).  The  transformations  involve  an 
mxm  factorisation  and  forward  redaction  and  backward  substitutions;  a  total  of 
4/3m*  operations.  And  it  is  our  experience  on  scalar  computers  that  EISQL  contains 
twice  as  many  operations,  i.e.  8/3 m*.  The  full  eigensolution  thus  represents  some  4m* 
operations.  For  large  typical  eigenproblems  this  is  negligible  compared  to  the  number 
of  operations  that  are  required  in  other  parts  of  the  program.  Nevertheless,  we 
replaced  dot  products  and  SAXPY’s  in  the  transformation  subroutines  with  equivalent 
vector  expressions.  The  EISQL  subroutine  cannot  be  vectorised. 

4.1.4.  Formation  of  naw  basin. 

Compute 

S<»> -**><**>  (11) 

This  can  be  written  as  ma  SAXPY’s  of  length  »,  cfr.  Fig.  7. 

Typically,  both  t  and  m  are  «  n,  and  the  number  of  necessary  iterations,  l,  is  about  20. 
E.g.t  when  n— 2000,  1—100,  then  to  8nd  40  eigenvalues  we  might  use  m— 50,  60,  70,  or  80,  and 
expect  convergence  in  1«15  to  30  iterations. 

Under  these  circumstances  the  dominant  parts  of  the  algorithm,  when  the  M  matrix  is  ding* 
on  si  (as  in  our  test  cases,  cfr.  Section  5),  are 

•  Factorisation 
nt*/2  operations 

•  Linear  operator  (4.1.2  (al),  (a3)) 

!(2!nm)  operations 

New  basis  (4.1.4) 

/(nm*)  operations 
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The  break-even  between  the  factorization  and  the  linear  operator  is  reached  after 
iterations  as  far  as  the  number  of  operations  is  concerned,  bat  becaase  of  the  short  vector  length 
daring  factorization  we  shall  expect  to  need  more  iterations  than  this  on  the  CYBER  205  before 
the  linear  operator  takes  as  mach  CPU-time  as  the  factorization. 

The  computation  of  the  new  basis  will  usually  be  far  less  expensive  than  the  first  two.  m  is 
usually  small  compared  to  2b ,  and  the  vector  length  is  equal  to  the  problem  size  n  in  this  compu¬ 
tation. 

There  is  a  potential  for  reduction  of  the  number  of  operations  in  the  fact  that  the  first 
eigenvalues/eigenvectors  will  converge  rather  quickly.  These  eigenvectors  may  then  be  locked, 
and  not  participate  e.g.  in  the  time-consuming  linear  operator  (Section  4.1.2(a)). 

4.2.  Lanesos’ Method  (LANC). 

We  used  a  simple  Lanczos’  algorithm.  A  single  random  starting  vector,  r<>,  is  iterated 
according  to  the  scheme  presented  in  (9|.  Initialisations  include  setting  q^O,  and  computing 
Po— Mr0  and 

In  each  step  j,  (  ;'«1,2,  •  •  •  ),  the  following  tasks  are  done: 

(a)  Orthogonalization 

r;_j  will  be  orthogonalized  against  the  previous  Lanczos  vectors  when  needed  [15). 

(This  is  called  selective  orthogonalizatioa.) 

A  maximum  of  (j- 2)  SAXPY’s  and  dot  products  with  vector  length  n. 

(b)  Compute  q;— ^  and  . 

P,  P, 

This  is  two  vector  operations,  vector  length  n. 

(c)  Solve  Xr;  for  r; . 

(This  is  equivalent  to  Section  4.1.2(a),  now  with  m«l.) 

-  “solve  (A)",  i.e.  (2*+l)n  operations,  n  dot  products  and  n  SAXPY’s  with 
average  vector  length  equal  to  b,  and  one  vector  Schur  multiply  with  vector 
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length  n . 

(d)  Compute  t,  —  v,-q,-i0,  • 

Single  SAXPY,  vector  length  n. 

(e)  Compute  a;— r/p ;-_j. 

Single  dot  product,  vector  length  n . 

(f)  Compute 

Single  SAXPY,  vector  length  n. 

(g)  Compute  p;=Mr;  and  /J;+1— y/rfPr 

-  “mult  (M)”,  i.e.  a  vector  operation  (Schur  product)  of  length  n  (when  M  is  diag¬ 
onal) 

-  a  dot  product,  vector  length  n 

If  01+l  is  small  compared  to  |<*j  |  and  9,,  (e),  (f),  and  (g)  will  be  repeated  once.  This  is 
found  to  take  place  in  fewer  than  1/4  of  the  steps. 

(h)  Analysis  of  the  symnu'ric  tri-diagonal  matrix  T;  which  has  the  a’s  as  diagonal  ele¬ 
ments  and  the  /7s  as  bidiagonal  elements. 

«80i  scalar  operations  (11). 

(i)  For  converged  eigenvalues  compute  eigenvectors  of  T,  aad  then  compute  eigenvectors 
x. 

j  SAXPY’s  with  vector  length  n  per  computed  x. 

Typically,  the  first  eigenvalue  will  converge  in  5  -  10  iterations,  aad  20  eigenvalues  will  con¬ 
verge  in  40  -  SO  iterations.  For  longer  runs  it  is  a  good  assumption  that  // 2  eigenvalues  will  have 
converged  in  I  iterations. 
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The  operation*  coant  for  a  LANC  run,  when  the  M  matrix  is  diagonal,  cfr.  Section  5, 
includes 

(1)  Factorisation 

r»42/ 2  operations;  dot  products,  average  vector  length  4/2 

(2)  Orthogonalization,  total  for  l  steps 

For  selective  orthogonalization  we  have  found  that  S3n/*/5  operations  are  required; 
equally  divided  between  SAXPY’s  and  dot  products,  each  of  vector  length  n. 

For  a  full  reorthogonalizatioo  n(P-l)  operations  are  required;  equally  divided  between 
SAXPY’s  and  dot  products,  each  of  vector  length  n 

(3)  t  Lanczos’  steps 

(3a)  (24+  l)nf  operations;  dominated  by  SAXPY’s  and  dot  products,  average  vector 
length  4  (solve  (X)) 

(3b)  5/4  nl  operations,  vector  length  n  (molt  (M)) 

(3c)  6 nl  operations;  equally  divided  between  dot  products  and  scalar  vector  products, 
vector  length  n 

(4)  Analysis  of  tri-diagonal  matrix,  T,  total  for  f  steps 
W40/1  operations;  non-vectorizable 

(5)  Computation  of  1/2  eigenvectors 

«3/8  nl3  operations;  estimated  for  the  case  that  one  eigenvalue-/ eigenvector  con¬ 
verges  in  each  of  the  1/2  last  steps.  This  is  an  overestimate,  more  typical  would  be 
1/8  nl*  to  1/4  nf*  operations.  These  operations  am  SAXPY’s  with  vector  length  n. 

A  comparison  of  operation  counts  shows  that  a  LANC  ran  with  fewer  than  4/4  steps  does 
not  permit  the  initial  cost  of  factorisation,  (1)  above,  to  be  amortised.  For  longer  runs  (#>4/4) 
the  Lancsos’  steps  (3)  wiU  require  more  operations  than  the  factorisation,  and  then  (selective) 
orthogonahzatiott  (2)  and  the  analysis  of  T  (4)  become  sigaifcaat  parts  of  a  LANC  step.  Because 
of  the  poor  vector  length  during  factorisation  as  compared  with  the  other  parts,  we  shall  expect 


to  Deed  even  more  steps  than  i/4  to  amortize  the  factorisation  on  the  CYBER  205. 

The  operation  count  break-even  between  parts  (4)  and  (5)  above  is  for  n«=»100,  bat  because 
(5)  is  vectorised,  the  two  modules  will  have  equal  CPU-time  on  the  CYBER  205  for  a  much 
higher  value  of  n .  Note  also  that  (5)  will  always  be  leas  than  half  the  cost  of  (2). 

S.  Test  examples. 

We  used  test  examples  that  typically  occur  in  structural  dynamics.  A  version  of  the  finite 
element  program  PEAP  {16,  Ch.  24]  was  converted  to  run  on  CYBER  205;  this  program  was  then 
used  to  generate  stiffness  matrices,  A,  and  mass  matrices,  M,  for  the  test  examples.  The  eigen¬ 
values,  X,,  are  the  squares  of  the  frequencies  of  free  vibration  of  the  structures,  w,,  i.e.  Xt«w* 

We  generated  a  total  of  4  sets  of  matrices,  with  the  order  of  the  matrices  ranging  from  150 
to  7296  (i.e.  n  »  150  to  7296).  In  all  4  examples  we  chose  to  have  a  diagonal  M.  There  is  no 
loss  of  generality  with  this  assumption.  Our  intention  was  only  to  keep  the  total  computational 
cost  down,  and  yet  be  able  to  examine  large  problems. 

Example  (I) 

n  —  150,  i  —  17 

This  is  a  simple  trass  structure. 

Example  (B) 
it  «•  468,  *  —  60 

This  is  a  structure,  first  presented  in  (2),  which  we  have  used  extensively  during  previous 
testing  |8|. 

Examples  (iii)  and  (hr)  were  generated  with  a  3-dimen sioaal  beam  element  The  model  is  aa 
idealisation  of  a  multistory  structure,  see  Fig.  8.  Each  story  bnd  the  same  geometry,  with 
(/9,-l)JVf  +  elements  parallel  to  ths  *-  and  y-sxea,  and  also  N,N,  elements  paraBst  to 

the  «-axis  (connecting  the  story  to  the  next  lower  story).  Of  ths  total  number  of  nodes,  NtNtNt, 
nfl  that  belonged  to  the  bottom  story  wars  held  fixed.  Each  nods  had  6  variables. 
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Model  used  in  examples  (iii)  and  (hr) 

Example  (111) 

With  Nt  -  Nt  -  4,  N,  -  20  we  get  n  —  1824,  b  -  98. 

Example  (hr) 

With  N,  -  /V,  -  8,  N,  -  20  we  get  n  -  7298,  b  -  370. 

The  model  was  so  constructed  that  whenever  N,  «■  ,  due  to  symmetry  there  wan  a  set  of 

doable  eigenvalues  corresponding  to  the  transverse  vibration  of  the  structure.  Both  Example  (iii) 
and  Example  (iv),  therefore,  contain  doable  eigenvalues. 
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6.  Test  runs. 

While  we  ru  our  4  test  examples  oa  both  SUB  IT  and  LANC,  the  CPU-time  eoasumed  by 
all  major  parts  of  the  programs  was  measured  by  the  CDC  FORTRAN  fuaetiou  SECOND  [3], 
and  we  kept  track  of  the  step  at  which  eigenvalues  were  accepted.  In  SUBIT  we  also  varied  the 
number  of  iteration  vectors,  m.  In  all  cases  a  number  of  the  smallest  eigen v alues/eigeavectors 
were  computed,  i.e.  those  closest  to  the  shift  <t»0. 

7.  Results  and  comparisons. 

7.1.  Effects  of  vectorlsatlon. 

7.1.1  Performance  haprovsasnti  through  vectorisation. 

As  pointed  out  in  Section  3.1  vectoriiatkm  may  be  achieved  through  an  optimisation  option 
for  the  FORTRAN  compiler  or  through  the  replacement  of  instructions  by  explicit  vector  instruc¬ 
tions.  Typically,  the  automatic  vectorisation  will  only  afleet  quite  obviously  vectoriiable  code, 
e.g.  “clean”  DO-loops,  as  in  Figs.  1,  3,  and  5.  More  complicated  sequences  of  instructions  will  not 
be  vectorised  through  the  compiler  option,  although  they  might  well  be  vectorisable.  We  wanted 
to  see  how  much  various  subroutines  improved  from  the  original  scalar  version  to  the  explicit  vec¬ 
torised  version,  and  made  test  runs  under  the  following  4  regimes: 

Condition  (A)s  The  code  was  as  before  vectorisation,  compiled  with  no  optimisation 
options. 

Condition  (B)i  The  code  was  as  before  vectorisation,  and  it  was  compiled  with  the  scalar 
optimisation  option. 

Condition  (C)t  The  code  was  as  before  vectorisation,  and  it  was  compiled  with  scalar  and 
vector  optimisation  options. 

Condition  (D)t  The  code  was  explicitly  vectorised  as  outlined  in  Section  3,  and  it  was 
compiled  with  scalar  and  vector  optimisation  options. 
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First  we  show  the  CPU-times  sad  mflop  rates  for  the  important  factorisation  snbrontine 
(PROFIL)  as  a  function  of  the  problem  sise,  see  Table  1. 

Table  1.  Performance  of  vectorised  factorisation  snbrontine  PROFIL 


Example 

Average 

vector 

length 

Operations 

(millions] 

CPU-time 

[seconds] 

Mlop 

rate 

0) 

0 

.022 

.0148 

1.49 

00 

30 

.84 

.1689 

5.0 

(iii) 

48 

8.23 

1.091 

7.6 

(iv) 

185 

499 

24.59 

20 

Note  that  that  these  rates  are  well  below  the  rates  that  we  have  presented  in  Appendix  A. 
The  reason  is  that  PROFIL  contains  various  conditional  statements  and  also  some  scalar  arith¬ 
metic  (e.g.  on  indices)  in  addition  to  the  vector  operations,  that  will  slow  down  the  code  accord¬ 
ingly.  This  is  also  the  case  for  other  subroutines. 

The  values  in  Table  1  are  for  condition  D,  i.e.  explicitly  vectorised  code.  Table  2  will  show 
how  much  was  gained  in  this  snbrontine  by  the  various  compiler  options. 

Table  S.  Performance  of  factorisation  subroutine  PROFIL 
under  various  optimisation  options 


Mlop 

rates 

Example 

Coed  A 

Coed  B 

Coed  C 

CoadD 

(i) 

.53 

.63 

1.03 

1.49 

(a) 

1.04 

1.64 

3.5 

5.0 

.94 

1.70 

4.4 

7.6 

In  the  following  three  tables  we  shall  see  better  performance,  due  to  longer  vector  lengths. 

The  M-operatioa,  Section  4.1.1(a)  and  (b),  was  coded  as  one  subroutine,  MPROJ.  For  diag¬ 
onal  M  the  effort  is  dominated  by  dot  products  of  length  ».  Table  3  shows  that  the  automatic 
vectorisatioa  worked  almost  as  well  as  our  explicit  veetorisatioa  for  this  subroutine.  Note  also 
Oat,  based  on  the  vector  lengths  given  in  Table  3,  the  performance  rates  for  the  vectorised  dot 
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products  alone  trt  tboat  27,  38,  40,  mod  49.5  mil  ops,  aa  obtained  from  Appendix  A. 
Ttblt  S.  MPROJ  performance  (diagonal  M) 


Example 

Vector 

kaith 

(  Mflop  rate  1 

Cond  A 

Cond  C 

Cond  D 

(0 

150 

1.3 

16 

20 

(») 

468 

1.4 

29 

32 

(iii) 

1824 

1.4 

38 

39 

Jsl _ 

7296 

47 

Note  that  we  could  reformulate  MPROJ  to  nae  SAXPY’a  inatead  of  the  dot  producta,  but 
theae  SAXPY’a  would  uae  much  ahorter  vector  length,  Le.  m,  and  therefore  would  lend  to  poorer 
performance.  From  Appendix  A  we  may  infer  that  with  m-*43  the  mflop  rate  for  the  vectorised 
MPROJ  with  SAXPY’a  would  be  lem  than  20  in  all  4  examples  (the  SAXPY’a  alone  would  per- 
form  at  about  22  mfops,  but  the  additional  work  would  be  at  scalar  speed).  It  is  the  vast 
difference  in  vector  lengths  that  makes  the  dot  products  superior  to  the  SAXPY’s  in  this  subrou¬ 
tine. 

The  formation  of  the  new  basis  in  SUBIT,  Section  4.1.4,  wa  coded  aa  a  subroutine, 
VFORM.  Tabk  4  shows  that  pure  SAXPY  expressions  with  long  vectors  are  indeed  very 
efficient.  If  VFORM  were  coded  using  dot  products  instead  of  SAXPY’s,  then  the  mffop  rate 
would  be  about  half  (and  the  CPU-time  about  twice)  the  values  given  in  Table  4. 

Table  4.  VFORM  performance,  Example  (hr) 


!  Analysis  of  list  step,  rn-23  and  m— 63  | 

m 

Vector 

Operations 

CPU-time 

Mflop 

length 

[millioasl 

[seconds] 

rate 

23 

7296 

3.86 

.0402 

96 

63 

7296 

28.96 

.3016 

96 

b  LANC  be  orthogonal atioa  tab,  Section  4.2(a),  may  be  coded  an  a  full  Gram-Schmidt 
orthogonaffxation,  as  in  our  subroutine  GSORT,  or  as  a  selective  orthofoaaHsatioa,  aa  b  our 
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subroutine  PRORT.  W«  shall  com  meat  oa  th*  choice  of  method  ia  Sectioa  7.3,  aad  at  tkia  poiat 
iaclade  performance  data  for  GSORT,  see  Table  5. 


Table  S.  Orthogoaalixatioa  performance  GSORT 


Example 

Vector 

I  Mlop  rate  | 

length 

Cond  A 

Cond  B 

ComlC 

CoudD 

(*) 

ISO 

1.0 

2.1 

13 

(ii) 

408 

2.0 

2.2 

S.1 

20 

(iii) 

1824 

2.0 

5.3 

42 

(*) 

7208 

50 

GSORT  contains  equal  amounts  of  SAXPY’s  aad  dot  products,  all  with  vector  length  n;  we 
should  therefore  expect  improved  performance  compared  with  MPROJ  in  Table  3.  It  is  surprising 
that  we  did  not  lad  this  to  be  the  case  in  Examples  (i)  aad  (ii).  Table  S  also  illustrates  that  the 
automatic  vectoriiatioa  did  not  vectorise  perfectly  vectorixable  code  (low  performance  for  condi¬ 
tion  C);  this  is  because  the  original  GSORT  contained  an  external  reference  to  a  general  purpose 
dot  product  function,  which  was  coded  in  a  way  the  automatic  vectoriser  did  not  recognise. 

7.1.3.  Thu  Importance  of  vector  length. 

So  far  we  have  aeea  that  the  performance  of  a  given  subroutine  improves  dramatically  with 
longer  vectors.  However,  this  improvement  is  not  as  big  as  the  direct  measurements  of  pure  vector 
codes  indicate.  For  subroutines  like  MPROJ,  VFORM,  aad  GSORT,  that  consist  almost  com¬ 
pletely  of  vector  iaatruetioas,  we  have  observed  a  markedly  lower  performance  than  that 
presented  ia  Appendix  A.  Typically,  n  subroutine  -  or  n  whole  program  for  that  matter  •  will 
iaclade  slow  parts  that  will  degrade  the  overall  performance  even  further.  Scalar  operations,  dot 
products,  and/or  operations  with  short  vector  lengths  are  examples  of  such  slow  parts.  In  order 
for  the  slow  parts  to  have  any  lignite  sat  ia  lienee  on  the  overall  performance,  however,  they 
mast  rupreseat  a  significant  fraction  of  the  total  work.  Below  we  shall  give  n  detailed  discussion 
of  two  Important  subroutines  that  bars  certain  operations  on  short  vectors  and  othar  operations 
on  long  vectors.  The  overall  performance  wil  ho  inbstwssn  whnt  would  have  boon  expected  for 
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the  short  rector  leagth  alone  sad  for  the  long  rector  length  alone. 

Within  SUB  IT  the  A-projection,  Section  4.1.2(al),  (m2),  (m3),  and  (b),  was  coded  as  one  sub¬ 
routine,  APROJ.  The  rector  lengths  in  APROJ  are  both  1  (on  arerage),  daring  4.1.2(al)  and 
(a3),  and  n,  daring  4.1.2(a2)  and  (b),  so  that  for  every  2nm  vector  operations  with  length  i  there 
are  (m*/2  +  m)  rector  operations  with  length  n.  Most  of  the  long  rector  operations  are  dot  pro¬ 
ducts  (m*/2  as  opposed  to  m  SAXPY’s).  Fran  Appendix  A  we  may  see  that  for  large  examples 
(i.e.  large  n )  the  mil  op  rate  over  the  long  rectors  will  be  aboat  40  to  50,  bat  never  significantly 
over  50.  Half  of  the  short  rector  operations  are  efficient  SAXPY’s.  When  the  short  rector  length 
t  is  greater  than  about  120,  these  SAXPY’s  will  also  execute  at  mflop  rates  of  aboat  40  or  more. 
The  other  half  of  the  short  rector  operations  are  dot  products  which  are  significantly  slower  than 
toe  SAXPY’s;  e.g.  with  rector  length  120  the  dot  product  mflop  rate  is  about  22. 

To  farther  illustrate  the  performance  of  APROJ  we  include  some  detailed  timings  of  one 
iteration  step  of  our  largest  example,  see  Table  6. 


Table  0.  Analysis  of  APROJ  rector  performance,  Example  (iv) 


|  Breakdown  of  first  iteration  step,  subspace  sise  ■  23 

Task 

Arerage 

Operations 

CPU-time 

Mflop 

rector  length 

Imillioas] 

(seconds) 

rate 

4.1.2(al) 

370 

82.1 

1.850 

34 

4.1.2(a2) 

7290 

.1878 

.00171 

98 

4.1.2(a3) 

370 

82.1 

1.067 

58 

4.1.2(b) 

7298 

1.93 

.0414 

47 

OVERALL 

128.3 

2.96 

43 

Ignoring  for  a  moment  the  slow-down  effect  of  n on-vectorised  parts  of  APROJ  we  may 
extrapolate  from  Appendix  A  that  the  overall  mflop  rate  for  this  subroutine  cannot  exceed  87, 
and  the  short  rector  length  (5)  has  to  be  larger  than  500  for  the  overall  mflop  rate  to  be  better 


than  50. 


The  subspace  site,  m,  win  not  greatly  influence  the  performance  rate,  but  of  course  the 
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CPU-time  for  tasks  4.1.2(al),  (a2),  and  (a3)  will  increase  linearly  with  m,  and  the  CPU-time  for 
task  4.1.2(b)  will  increase  quadratics!!?  with  m.  Table  7  contains  a  summary  of  overall  perfor¬ 
mance  rates  for  APROJ  for  all  oar  examples. 


Table  7.  APROJ  performance,  vectorised  code 


Example 

Vector 

lengths 

Subspace 

sue 

Mlop 

rate 

0) 

17/150 

23 

6.1 

(0 

17/150 

43 

7.2 

(ii) 

00/468 

23 

15.4 

(ii) 

60/468 

43 

16.2 

(ii) 

60/468 

63 

16.9 

(iii) 

95/1824 

23 

21.2 

(iii) 

95/1824 

43 

21.9 

(iii) 

95/1824 

63 

22.4 

H 

370/7296 

23 

42.4 

(hr) 

370/7296 

43 

42.6 

370/7296 

63 

42.0 

Within  LANC,  see  Section  4.2,  the  tasks  (b)  -  (g)  were  coded  as  one  subroutine,  LANSIM. 
Daring  task  (c)  there  are  vectors  of  average  length  #,  bat  in  the  other  tasks  the  vector  length  is 
eqaal  to  n.  In  an  average  step  there  are  about  9  vector  operations  of  length  s  (0.5  (fast)  vector 
multiply’*  or  SAXPY’s,  and  2.5  (slow)  dot  products)  as  opposed  to  2a  vector  operations  (n 
SAXPY’s  and  n  dot  products)  with  average  vector  length  i.  Thus  the  shorter  vector  length 
occurs  much  more  often  than  the  longer  vector  length  in  typical  examples,  e.g.  100  times  more 
often  in  Example  (ii),  and  nearly  400  times  more  often  in  Example  (iii).  As  a  result  the  overall 
mflop  rate  for  LANSIM  will  be  determined  by  the  shorter  vector  length,  and  we  have  in  fact  a 
situation  that  is  quite  similar  to  what  we  have  described  above  for  the  APROJ  subroutine  in 
SUBIT.  Again  based  on  data  in  Appendix  A  the  problem  must  be  very  large,  with  short  vector 
length  >  500  (if  we  ignore  the  effect  of  noo-vectoraed  parts),  to  give  an  overall  mlop  rate  in 
LANSIM  that  is  higher  than  50.  Table  8  shows  the  performance  that  wo  measured  for  LANSIM 
for  our  more  typical  examples. 


Table  8.  Laaeiot'  step  performance  LANSIM 


Example 


Operations 

[millions] 

CPU-time 

[seconds] 

Mflop 

rate 

.507 

.1183 

4.3 

6.00 

.407 

12 

36.3 

2.08 

17 

546 

13.66 

40 

7.1.8.  Module*  that  were  not  vector laed. 

The  eolation  of  the  small  eigeaproblem  in  SUBIT,  Section  4.1.3,  was  written  an  one  subrou¬ 
tine,  GEIG,  that  called  the  set  of  transformatioa  subroutines  and  also  E1SQL.  The  transforma¬ 
tion  modules,  representing  about  1/3  of  the  operations,  were  vectorised  with  vector  lengths  vary¬ 
ing  from  1  to  m  (m/3  on  average).  According  to  Appendix  A  dot  products  and  SAXPY’s  perform 
equally  well  on  CYBER  205  for  the  vector  lengths  that  we  used  in  these  transformatioa  modules 
in  our  test  runs  (with  m  *■  23,  43,  and  63).  As  a  result  of  vectorisstioa  of  the  transformation 
modules,  EISQL's  fraction  of  the  total  time  in  GEIG  was  found  to  increase  from  64%  (scalar)  to 
74%  (subspace  sue  m— 23),  80%  (m—43),  and  84%  (m*MJ3).  The  performance  rate  of  GEIG 
came  out  between  1.0  mflope  (m— 23)  and  2.0  mflops  (m— 63). 

Within  L.ANC  them  is  also  an  unvectorised  subroutine,  ANALZT,  that  performs  the 
analysis  of  the  tridiagoaal  matrices,  Section  4.2(h).  The  operations  count  for  ANALZT  that  we 
have  given  them,  is  quite  approximate.  Taken  literally  it  indicates  a  performance  rate  of  about  .0 
•  1.5  mflops.  We  recorded  an  improvement  of  about  10%  for  this  subroutine  through  the  scalar 
optimisation  option  for  the  compiler. 


7J,  Overall  SUBIT  performance. 

Because  the  performance  rate  is  diflemat  far  the  diferent  modules  of  SUBIT  it  is  of  interest 
to  look  at  the  overall  performance  of  the  method. 
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As  explained  in  previous  sections  the  SUB  IT  step  consists  mainly  of  4  subroutines:  MPROJ 
(Section  4.1.1),  APROJ  (Section  4.1.2),  GEIG  (Section  4.1.3),  and  VFORM  (Section  4.1.4).  With 
subspace  size  m  ■■  23  the  slowest  of  these  modules,  GEIG,  represents  from  14.7%  (in  Example 
(i))  to  .04%  (in  Example  (iv))  of  the  number  of  operations  in  each  SUBIT  step,  but  this 
amounts  to  from  45.9%  (Example  (i))  to  .8%  (Example  (iv))  of  the  CYBER  205  CPU-time  in 
each  SUBIT  step,  as  shown  in  Table  9. 


Table  9.  Floating  point  operations  and  CPU-times 
for  modules  in  a  SUBIT  step 
(Given  as  percentages  of  whole  step] 


Example 

m 

I  MPROJ 

APROJ 

GEIG 

VFORM  ! 

ops 

time 

ope 

time 

ops 

time 

ops 

time 

(') 

a— 150 

23 

13.0 

3.3 

48.3 

46.6 

14.7 

45.9 

23.9 

4.2 

b-17 

43 

13.1 

3.2 

33.0 

26.5 

28.8 

66.0 

25.1 

4.3 

(ii) 

23 

7.3 

3.0 

76.8 

73.1 

2.6 

20.7 

13.3 

3.2 

n-468 

43 

10.1 

3.7 

63.7 

53.1 

7.1 

38.9 

19.2 

4.3 

b— 60 

03 

11.5 

3.9 

54.1 

39.6 

12.0 

52.0 

22.3 

4.5 

(Ui) 

23 

5.2 

2.6 

84.3 

89.1 

.6 

8.7 

9.6 

2.5 

a— 1824 

43 

8.0 

4.1 

75.3 

79.0 

1.4 

12.9 

15.3 

4.1 

b-95 

63 

9.9 

5.0 

68.1 

69.3 

2.7 

20.6 

19.3 

5.1 

(iv) 

23 

1.6 

1.4 

96.5 

96.4 

.04 

.8 

2.9 

1.3 

n— 7296 

43 

2.7 

2.5 

92.0 

98.2 

.12 

2.0 

5.2 

2  a 

b— 370 

63 

3.7 

3.3 

88.8 

89.9 

.25 

3.6 

7.2 

3.2 

Table  9  clearly  shows  how  dominating  APROJ  is,  and  increasingly  so  with  increasing  prob¬ 
lem  size  n,  both  in  terms  of  number  of  operations  and  CPU-time.  (Should  M  not  be  diagonal, 
but  have  a  banded  form  similar  to  A,  then  MPROJ  should  be  expected  to  take  jest  as  many 
operations  and  as  much  CPU-time  as  APROJ.) 

The  maxim  that  GEIG  represents  a  negligible  effort  for  medium  size  and  large  problems  is 
no  longer  tree  on  vector  computers.  From  a  coarse  interpolation  between  the  values  of  Table  9 
we  have  found  that  while  GEIG  has  leas  than  10%  of  the  number  of  operations  in  a  SUBIT  step 
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for  problem  sise  n  larger  tbaa  about  200  (m*23)  to  500  (maB63),  its  fraction  of  tbe  step'*  CPU¬ 
time  is  still  over  10%  for  much  larger  problems,  until  n  is  abont  1000  (m» 23)  to  4000  (m— 63). 


Table  10.  Floating  point  operations  sad  CPU-times 
Initial  PROFIL  vs.  one  SUB  IT  step 


Example 

m 

1  PROFIL 

SUBH 

'  step 

ops 

(millions! 

CPU-time 

(seconds] 

ops 

[milHonsI 

CPU-time 

[seconds! 

(0 

n— 150 

23 

.022 

.0148 

.331 

.0562 

b-17 

43 

1.104 

.1014 

(ii) 

23 

.842 

.1680 

1.850 

.1263 

n— 468 

43 

4.504 

.3327 

b— 60 

63 

8.311 

.6720 

(Hi) 

23 

8.231 

1.001 

10.033 

.4403 

tt— 1824 

43 

22.12 

.0647 

bH)5 

63 

37.54 

1.645 

(W) 

23 

400.4 

24.50 

132.3 

3.087 

n— 7296 

43 

260.1 

6.027 

b— 370 

63 

400.0 

0.411 

In  small  problems  the  initial  factorisation  (done  by  PROFIL)  will  be  negligible  compared  to 
a  SUB  IT  step  in  terms  of  number  of  operations  as  well  as  CPU-time.  It  is  only  in  our  largest 
example.  Example  (iv),  that  we  flad  PROFIL  to  represent  a  significant  portion  of  a  typical  run's 
number  of  operations  and  CPU-time,  as  is  clearly  seen  from  Table  10.  Considering  the  fact  that 
we  usually  will  perform  more  than  10  steps  of  SUB  IT,  we  therefore  lad  this  method  relatively 
insensitive  to  bow  efficiently  the  factorisation  is  done,  unless  the  problem  is  very  large  and  few 
eigenvalues  are  sought. 

Our  SUBIT  program  "accepts”  an  eigenvalue  (in  the  sense  that  it  will  be  counted  as  a  con¬ 
verged  eigenvalue)  when  the  present  approximation  to  the  eigenvalue  is  closer  to  the  previous 
approximation  (i.e.  horn  previous  step)  than  10"*.  The  last  few  eigenvalues  that  are  accepted 
might  therefore  sot  be  very  accurate,  bat  they  will  continue  to  improve  is  the  next  iterations. 
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YVhes  the  program  finishes  it  ia  generally  the  eaee  that  more  than  80%  of  the  accepted  eigen¬ 
values  have  been  found  within  1<T12  of  the  last  previous  approximation.  If  we  are  looking  for  a 
fixed  number  (p )  of  eigenvalues,  the  number  of  iterations  that  are  required  seems  to  be  indepen¬ 
dent  of  problem  size  (n),  but  it  is  quite  dependent  on  the  number  of  iteration  vectors  (m).  The 
following  Table  11  summarizes  our  convergence  experience  for  the  4  test  examples  using  different 
subepace  sizes.  Both  in  Example  (iii)  and  in  Example  (iv)  to  find  30  eigenvalues  it  takes  20  itera¬ 
tions  with  m»«43,  but  15  iterations  with  m— 63. 

Table  11.  Number  of  converged  eigenvalues  in  SUBIT 


Subspace 

size 

No.  of 
steps 

n-150 

n— 468 

nai824 

n— 7296 

m— 23 

5 

2 

1 

3 

3 

10 

9 

12 

7 

11 

15 

13 

15 

12 

14 

m—43 

5 

3 

7 

3 

3 

10 

13 

25 

20 

14 

15 

16 

32 

26 

21 

20 

25 

31 

30 

25 

30 

31 

31 

30 

30 

m— 63 

5 

8 

5 

3 

10 

32 

20 

15 

15 

42 

31 

32 

20 

44 

38 

42 

25 

47 

47 

42 

30 

47 

47 

50 

Our  results  indicate  that  the  formula  m—min(2p,2p  + 8),  that  is  often  used  to  determine 
the  number  of  iteration  vectors,  ieeds  to  a  subspace  size  that  is  smaller  thaa  the  optimum  size,  at 
least  when  p>15. 

74.  Overall  LANC  performance. 

A  LANC  run  consists  mainly  of  the  following  subroutines:  PROFtt,  and  STPONE  (Section 
4.2,  initialisations),  performed  once;  and  GSORT  (Section  4.2(a)),  LANSIM  (Section  4.2(b)-(g)), 
ANALZT  (Section  4.2(h)),  and  VECT  (Section  4.2(1)),  performed  repeatedly.  GSORT,  LANSIM, 
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and  ANALZT  are  performed  in  each  step;  VECT  only  in  those  steps  when  an  eigenvalue  has  eon* 
verged  and  its  eigenvector  is  to  be  computed. 

Our  intention  was  to  run  LANC  with  our  latest  version  of  selective  orthogoaahzatiou.  How* 
ever,  our  selective  code,  PRORT,  which  worked  perfectly  on  our  serial  computer,  began  to  mis* 
behave  on  the  CYBER  205  on  the  large  example  (Example  (iv)).  For  this  reason  we  switched  to 
fall  reorthogonalisstion,  GSORT,  for  all  our  tests.  Separately  we  compared  the  two  techniques 
on  Examples  (i),  (ii),  and  (iii)  and  found  that  PRORT  required  about  1/3  the  CPU-time  of 
GSORT.  As  we  shall  see  (e.g.  Table  12)  this  is  not  a  significant  advantage  for  PRORT  since 
GSORT  vectorises  so  well  and  is  a  minor  part  of  the  LANC  loop.  This  is  in  strong  contrast  to 
the  situation  with  serial  computers  in  which  full  orthogonal izati on  comes  to  dominate  unless  the 
LANC  runs  are  kept  short. 

Because  all  of  GSORT,  ANALZT,  and  VECT  gradually  involve  more  operations  as  the 
iteration  progresses,  whereas  LANSIM  has  the  same  number  of  operations  in  every  step,  the  tela* 
tive  importance  of  GSORT,  ANALZT,  and  VECT  will  increase  with  the  number  of  iterations. 

Table  12  gives  cumulative  CPU-times  for  the  subroutines  within  the  LANC  loop  after  seme 
typical  number  of  steps,  l.  The  number  of  eigenvalues,  c,  that  have  converged  to  machine  preci¬ 
sion  (1.42108X  10-14)  is  also  given.  The  VECT  column  contains  estimates  on  the  CPU-time  for 
the  eigenvector  computation,  Section  4.2(5).  The  estimates  are  taken  to  be  1/2  of  the  CPU-time 
measured  for  GSORT;  this  is  most  certainly  an  over-estimation.  The  column  for  ANALZT  in  the 
largest  example  alto  contains  estimates,  taken  to  be  the  time  for  40 f*  operations  at  1  mflops.  The 
results  am  presented  separately  for  Examples  (i),  (ii),  (iii),  and  (hr). 

In  small  problems  the  unvectorised  ANALZT  will  take  a  significant  portion  of  the  CPU-time 
in  the  LANC  loop.  Even  in  Example  (IB)  ANALZT  takes  about  as  much  time  as  the  orthogonaii- 
satioa  subroutine,  GSORT,  while  them  am  about  45  times  as  many  operations  in  GSORT.  After 
44  LANC  steps  (Example  (iii))  ANALZT  km  consumed  about  7%  of  the  LANC  loop  time;  its 
sham  rises  to  about  15%  after  100  steps,  and  it  wiB  eventually  take  mom  time  than  LANSIM 
(after  some  400  steps).  If  on  n  scalar  computer  we  can  achieve  1  mfiops  for  all  operations,  we  can 


Table  IS.  Cumulative  CPU-time*  for  lubroutines  in  the  LANC  loop 
All  times  are  in  seconds 


Example 

1 

c 

GSORT 

LANSIM 

ANALZT 

VECT 

(i) 

10 

3 

.0020 

.0135 

.0070 

(.0010) 

n  -  150 

20 

5 

.0062 

.0287 

.0177 

(.0031) 

b-17 

40 

17 

.0282 

.0613 

.0623 

(.0141) 

80 

40  i 

.0771 

.1183 

.2575 

(.0386) 

(ii) 

10 

1 

.0013 

.0453 

.0060 

(.0007) 

n  —  488 

20 

5 

.0058 

.0963 

.0182 

(.0029) 

b— 60 

40 

15 

.024 

.196 

.054 

(.012) 

80 

37 

.099 

.396 

.188 

(.050) 

100 

54 

.156 

.497 

.271 

(.078)  | 

150 

69 

.352 

.748 

.545 

(.176) 

(«i) 

10 

2 

.0034 

.1908 

.0074 

(.0017) 

n  —  1824 

20 

5 

.0153 

.4053 

.0197 

(.0077) 

b— 95 

44 

20 

.081 

.987 

.082 

(.040) 

100 

46 

.418 

2.080 

.362 

(.209) 

(hr) 

10 

2 

.0115 

1.241 

(.0040) 

(.0058) 

a  »  7296 

20  | 

3 

.053 

2.621 

(.016) 

(.027) 

b— 370 

40 

17 

.226 

5.382 

(.064) 

(.113) 

60 

26 

.520 

8.142 

(.144) 

(.260) 

80 

59 

.934 

10.902 

(.256) 

(.467) 

100 

72 

1.469 

13.664 

(•<”> 

_LZ35) 

expect  ANALZT  to  catch  up  with  LANSIM  at  a  rate  that  is  about  17  times  slower  than  this,  cfr. 
Table  8.  GSORT  would  have  been  slowed  down  by  a  factor  of  42,  and  would  have  taken  about 
half  of  the  time  taken  by  LANSIM  after  100  LANC  steps  with  Example  (iti)  oa  this  fictitious 
computer. 

The  full  LANC  run  also  includes  initialisations,  most  of  which  is  the  factorisation  done  by 
PROFIL.  Table  13  shows  that  the  inititialisation  cost  may  be  quite  substantial  compared  to  the 
cost  of  the  LANC  loop;  e.g.  in  Example  (iii)  first  after  some  00  steps  does  the  LANC  loop  take 
more  CPU-time  than  the  initialisation,  and  in  Example  (hr)  this  does  not  happen  in  100  steps. 

For  this  method  as  well  the  number  of  converged  eigenvalues  seems  to  be  independent  of 


problem  sisc  (n).  For  all  ex  am  plea  about  30  steps  were  necessary  for  the  first  10  eigenvalues  to 
converge.  Table  14  shows  the  exact  step  when  each  of  the  10  list  eigenvalues  converged  to 
machine  precision.  However,  the  eigenvalues  will  not  necessarily  be  found  in  strictly  inereasing 


27 


Table  IS.  laitialixatioa  vs.  cnmelaihre  LANC  steps 
Floating  point  operations  sad  CPU-time 


Example 

No. 

Initialisation 

LANC  steps 

of 

ope 

CPU-time 

ops 

CPU-time 

step* 

(millionsl 

[seconds] 

[millionsl 

(seconds) 

(i) 

10 

.027 

.0211 

.090 

.0235 

a  -  ISO 

20 

.233 

.0557 

b— 17 

40 

.678 

.166 

80 

2.20 

.492 

(H) 

10 

.900 

.193 

.675 

.0533 

n  —  468 

20 

1.50 

.159 

b— 60 

40 

3.59 

.286 

80 

9.55 

.733 

100 

13.42 

1.00 

150 

25.7 

1.82 

(iii) 

10 

8.58 

1.50 

3.90 

.204 

n  -  1824 

20 

8.34 

.448 

b— 95 

44 

21.3 

1.19 

100 

63.9 

2.44 

(W) 

10 

505 

26.0 

55.7 

1.26 

n  -  7296 

20 

114 

2.72 

b— 370 

40 

236 

5.79 

367 

9.07 

507 

12.56 

655 

16.3 

Table  14.  Initial  convergence  in  LANC 


Eigenvalue 

no. 

I  Iteration  no.  when  eigenvalue  converges  I 

mu 

■u 

ss 

Example 

(Si) 

1 

6 

9 

6 

5 

2 

8 

16 

9 

10 

3 

9 

17 

11 

12 

4 

15 

18 

12 

21 

5 

18 

20 

18 

21 

6 

21 

22 

21 

22 

7 

21 

23 

26 

22 

8 

24 

24 

27 

24 

9 

25 

27 

27 

25 

10 
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order.  This  ia  particularly  the  caae  when  there  are  doable  eigenvalues,  as  in  Example  (ill)  and 
Example  (iv).  E.g.  ia  Example  (iii),  when  30  eigenvalues  have  converged,  they  are  ia  fact  found 
to  be  all  of  X|  to  X*.  then  Xa  through  Xjq,  and  XM.  The  “missing"  eigenvalue,  X0,  ia  equal  to 
XM,  and  it  was  found  ia  a  few  more  steps  (along  with  Xtl>  X**  X*,  and  X0).  In  practice  it  is  teas* 
swing  to  do  one  extra  factorisation  simply  to  check  that  there  are  no  missing  eigenvalues  among 
the  cnes  which  have  been  computed.  This  represents  an  added  cost  of  n**/2  operations;  that  has 
not  been  included  ia  our  tables  or  figures. 

7.4.  Comparisons. 

Examples  (i),  (ii),  and  (iii)  could  be  solved  within  the  available  CYBER  205  primary 
storage,  and  a  comparison  of  the  two  methods  may  then  justly  consider  only  the  CPU-time  that 
was  needed  for  the  computation.  Fig.  0,  10,  and  11  show  a  graph  of  the  total  CPU-time  vs.  the 
number  of  converged  eigenvalues  for  these  examples. 


F!«.  0.  SUBfr  vs.  LANC,  Example  (i),  *-150 


Note  that  the  CPU-times  for  LANC  have  been  multiplied  by  10  in  Figs.  9, 10,  and  11.  We 
feel  it  fair  to  say  that  LANC  is  an  order  of  magnitude  faster  than  SUBIT  on  these  examples,  but 
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Pig.  10.  SUBIT  va.  LANC,  Example  (ii),  n-468 


N».  ft 


Pig.  11.  SUBIT  va.  LANC,  Example  (Ui),  a -1824 

the  tread  ie  for  SUBIT  to  gala  relmtire  to  LANC  aa  tke  problem  die  n  iaereaeee.  Thia  la  of 
comae  dee  to  tke  fact  that  tke  coat  of  tke  iaitial  factorisatioe  la  mere  domiaaat  ia  LANC.  It  la 
atao  to  be  aeea  tkat  SUBIT  ia  at  ita  beat  wbea  eery  few  ekeapain  are  aoaxkt;  tbea  a  email 
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number  of  iteration  vectors  (m)  is  sufficient.  LANC  is  definitely  wperior  also  in  these  cases, 
although  not  by  an  order  of  magnitnde. 

Note  also  that  we  are  comparing  the  computation  of  the  same  flxod  number  of  eigenpairs, 
up  to  about  50,  irrespective  of  problem  site  it.  But  it  may  be  more  common  to  look  for  more 
eigenvalues  in  a  larger  example  than  in  a  small  one.  From  the  tendencies  that  we  have  observed, 
e.g.  in  Figs.  0  -  11,  we  may  argue  that  when  we  are  looking  for  a  number  of  eigen  pain  tha£  is  the 
same  fixed  fraction  of  the  problem  sise,  SUB1T  is  no  longer  seen  to  gain  relative  to  LANC  with 
increasing  problem  sise.  However,  we  do  not  want  to  stretch  this  argument  too  far,  because  tbe 
best  way  to  compute  a  large  number  of  eigea pairs  (with  either  method)  will  ordinarily  include  the 
use  of  repeated  shifts  and  factorisations,  which  is  not  being  discussed  in  the  present  report. 

Our  largest  example.  Example  (hr),  required  almost  3  million  words  to  store  the  A  (or  L) 
matrix.  Because  a  typical  step  both  in  SUBIT  and  in  LANC  involves  2  passes  through  the  L 
matrix  and  there  are  leas  than  2  million  words  of  primary  storage,  it  is  evident  that  extensive 
swapping  did  take  place  during  the  course  of  the  programs.  Although  it  is  possible  to  insert  com* 
mands  ia  the  program  that  may  advise  tbe  paging  system  about  pages  that  can  be  removed  from 
the  primary  storage  and  pages  that  are  needed  ia  the  primary  storage,  we  relied  solely  oa  the 
standard  scheduling. 

Fig.  12  shows  the  comparison  of  SUBIT  aad  LANC  as  far  as  CPU-time  is  concerned  for 
Example  (hr).  To  compute  30  eigeapairs  ia  this  case  required  about  4  times  as  much  CPU-time  in 
SUBIT  as  ia  LANC. 

Table  15  summarises  some  typical  runs  of  Example  (iv)  aad  compares  the  cost  of 
ia  put/output  to  the  cost  of  computation.  I  is  tbe  number  of  iteration  steps,  m  is  tbe  subspace 
sise  (ia  SUBIT),  LPF  is  the  number  of  large  page  faults,  "Penalty"  is  the  SBU  corresponding  to 
the  large  page  faults,  aad  "Comp.time"  is  the  total  CPUtime  for  the  computation.  The  load 
module  for  both  programs  was  about  61  large  pages. 

Whet  ire  are  searching  for  quite  few  eigenvalues,  SUBIT  may  be  run  with  a  small  subspace 
(m)  aad  get  away  with  fewer  page  faults  per  iteratioa  step.  Ia  LANC  it  is  seea  that  the  number 
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pt|«  faults  is  larger  thaa  the  CPU-time  for  tke  computation.  (Aad  when  M  is  not  diagonal,  we 
•hall  expect  a  substantial  increase  ia  tke  limber  of  page  finite,  became  we  shall  also  need  to 
make  a  pus  throagh  M.)  Because  SUBIT  work!  on  several  vectors  at  tke  same  time  aad  con¬ 
verges  to  a  required  number  of  eigenvalues  aad  eigenvectors  ia  fewer  steps  thaa  does  our  simple 
LANC,  SUBIT  may  perform  better  thaa  (simple)  LANC  ia  cases  like  Example  (iv),  where  the 
problem  is  too  big  for  the  primary  storge  and  extensive  paging  takes  place. 

It  is  possible  to  work  with  more  thaa  one  vector  ia  each  step  of  the  Laacsos  method,  using 
Block  Laacsos  [14]  (called  BLANC  hereafter).  When  the  block  sixe  is  •  (i.e.  t  vectors),  each 
BLANC  step  will  involve  about  *  times  as  many  operations  aad  i  times  as  much  CPU-time  as  a 
step  of  LANC.  At  the  same  time  the  number  of  steps  required  for  a  certain  number  of  eigeupairs 
to  converge  in  BLANC  will  be  only  a  fraction  (roughly  1/r)  of  that  in  LANC,  at  least  for  fairly 
long  runs.  The  net  effect  is  that  the  computation’s  total  CPU-time  will  remain  about  the  sum. 
However,  the  overall  number  of  page  faults  with  BLANC  will  be  greatly  reduced  (by  a  factor  of 
«)  compared  to  LANC.  This  is  because  still  only  two  passes  through  L  ate  needed  in  each  step  of 
the  algorithm.  For  the  example  ia  Table  15  it  is  sees  that  already  with  *—  2  BLANC  would  have 
had  page  fault  penalties  comparable  with  SUBIT,  and  a  larger  block  site  (•)  would  have  reduced 
the  number  of  page  faults  even  farther. 

When  BLANC  is  used  for  Indiag  very  few  eigen  pairs  and/or  with  a  large  block  sire,  the 
number  of  steps  is  reduced  by  a  factor  that  is  leas  thaa  >,  aad  tke  total  CPU-time  will  increase 
compared  with  simple  LANC.  Further,  the  saving  ia  the  number  of  page  faults  will  be  smaller 
because  of  the  increase  ia  the  number  of  steps. 

We  conclude  that  for  problems  that  are  too  big  for  the  primary  storage,  BLANC  would  be 
more  efficient  than  LANC  because  of  tke  smaller  penalties  for  page  faults.  Both  LANC  and 
BLANC  have  smaller  CPU-time  thaa  SUBIT.  Tbs  number  of  page  faults  occuriag  ia  LANC  is 
greater  thaa  that  for  SUBIT.  With  a  proper  choice  of  block  sixe  BLANC  will  remain  at  the  same 
low  level  for  the  CPU-time  as  simple  LANC,  aad  at  the  same  time  have  fewer  page  faults  thaa 


SUBIT. 
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At  preseat  we  are  experimental  with  different  ways  of  adapting  ANALZT  to  block  tri diag¬ 
onal  matrices. 
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Appendix  A-2 

The  megallop  rates  that  have  been  plotted  in  the  diagram  in  Appendix  A-l,  were  obtained 
by  the  following  program.  Note  that  for  each  n  500  vector  operations  (dot  product,  Sehnr  pro¬ 
duct,  and  SAXPY)  were  measured  by  die  SECOND  function. 

PROGRAM  LOOP  (TAPE6— OUTPUT) 

DIMENSION  X(32768),  Y(327«8),  Z(32708),  T(3),  P(3),  TT(4) 

X(l;32768)  —  1. 

Y(l;32768)  -  2. 

N-  32768 
C 

100  T(l;6)  -  0. 

TT(1) »  SECOND() 

DO  300 1  -  1,  500 

A  -  Q8SDOT  (X(1;N),Y(1;N)) 

300  CONTINUE 

TT(2)  *  SECOND() 

DO  400  I  -  1,  500 

Z(1'»N)  »  X(1;N)  *  Y(1;N) 

400  CONTINUE 

TT(3)  -  SECOND() 

DO  500  I  -  1,  500 

Z(1;N)  -  A*X(1;N)  +  Y(1;N) 

500  CONTINUE 

TT(4)  -  SECONDf) 

DO  600  J  —  1,  3 

T(J)-TT(J+1)-TT(J) 

600  CONTINUE 

P(l;3)  —  5.E-4  •  N  /  T(l;3) 

WRITE  (6,1000)  N,T,P 

1000  FORMAT  (IX,2HN-,15,3F10.6,3F6.2) 

N-N/2 

IF  (N.GT.l)  GOTO  100 
C 

STOP 

END 
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Lanczos  Method  In  order  to  take  advantage  of  the  special  vector  processor  of  the 
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parisons  support  the  following  general  statements.  Both  methods  require  the 
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inates  the  total  computation  as  n  -*■  ®  provided  that  the  number  of  wanted  eigen- 
pairs,  p,  remains  fixed  (independent  of  n).  However,  simple  Lanczos  is  at  least 
an  order  of  magnitude  more  eifficient  (in  CPU-time)  for  the  remainder  of  the 
computation.  For  p*40,  n=500  the  factorization  time  is  not  important  and  the 
full  order  of  magnitude  difference  is  seen  in  the  total  CPU-time.  When  p=40, 
n=8000  simple  Lanczos  is  only  4  times  faster  than  Subspace  Iteration  on  the  CYBEF 
205.  This  confirms  experience  on  serial  computers. 

For  problems  that  cannot  fit  into  primary  storage,  input-output  becomes  increas¬ 
ingly  important.  We  found  that  the  cost  of  input/output  dominated  ober  the  CPU- 
cost  for  a  problem  that  required  twice  the  available  primary  storage  on  our  CYBER 
205.  However,  this  will  depend  on  the  billing  algorithm^ the  computer  center. 

We  conclude  that  problems  which  have  a  substantial  overhead  in  reading  and  wrltin 
the  matrices,  should  not  be  solved  by  the  simple  Lanczos  Method,  but  by  a  Block 
Lanczos  Method. 
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